Homework 2¶

ECS 271¶

Kay Royo¶

1.)¶

Implement a single-layer fully connected neural network model to classify MNIST images. It inputs a raw image as a 1D vector of length 784=28x28 and outputs a vector of length 10 (each dimension corresponds to a class). You may want to add a bias term to the input, but that is optional for this assignment. The output is connected to the first layer simply by a set of linear weights. The second layer uses Softmax function as the non-linearity function. Softmax is a simple function that converts a n-dimensional input (z) to a n-dimensional output (o) where the output sums to one and each element is a value between 0 and 1. It is defined as

$$y_i = \frac{exp(x_i)}{\sum_{j=1}^n exp(x_j)}$$

When we apply this function to the output of the network, $o$, it predicts a vector which can be seen as the probability of each category given the input $x$:

$$P(c_i|x) = \frac{exp(o_i)}{\sum^n_{j=1} exp(o_j )}$$

where $n$ is the number of categories, 10, in our case. We want the $i$’th output to mimic $P(c_i|x)$, the probability of the input $x$ belonging to the category $i$. We can represent the desired probability distribution as the vector $gt$ where $gt(i)$ is one only if the input is from the $i$’th category and zero otherwise. This is called one-hot encoding. Assuming $x$ is from $y$’th category, $gt(y)$ is the only element in $gt$ that is equal to one. Then, we want the output probability distribution to be similar to the desired one (ground-truth). Hence, we use cross-entropy loss to compare these two probability distributions, $P$ and $gt$:

$$L(x,y,w) = \sum_{i=1}^n - gt(i) log(P(c_i|x))$$

where $n$ is the number of categories. Since $gt$ is one hot encoding, we can remove the terms of $gt$ that are zero, keeping only the $y$’th term. Since $gt(y) = 1$, we can remove it in the multiplication to come up with the following loss which is identical to the above one:

$$L(x,y,w) = - log(P(c_y|x))$$

This is the loss for one input only, so the total loss on a mini-batch is:

$$L = \sum_{k=1}^N -log(P(c_{yk}|x_k)) $$

where $N$ is the size of mini-batch, number of training data being processed at this iteration.

Please implement stochastic gradient descend (SGD) algorithm from scratch to train the model. You may use NumPy, but should not use PyTorch, TensorFlow, or any similar deep learning framework. Use mini-batch of 10 images per iteration. Then, train it on all MNIST train dataset and plot the accuracy on all test data for every n iteration. Please choose n small enough so that the graph shows the progress of learning and large enough so that testing does not take a lot of time. You may use smaller n initially and then increase it gradually as the learning progresses. Choose a learning rate so that the loss goes down.

Answer:

Using a training data with $60,000$ instances and a mini-batch of size $10$ for every iteration mean that we have a total of $\frac{instances}{batch size} = \frac{60,000}{10} = 6,000$ batches or iterations.

Since this is a multi-class classification problem, we want to minimize the cross-entropy loss function: $$L(Y, \hat{Y})= - \sum_{i=1}^n y_i \cdot log(\hat{y_i}) $$

where $n= 10$ is the number of categories, $y_i =$entries of the ground truth label $Y$ (one-hot encoded), and $\hat{y_i} =$ entries of the prediction vector $\hat{Y}$

If we take the average of the loss function over training samples of batch size $m$, we have

$$L(Y, \hat{Y})= - \frac{1}{m} \sum_{j=1}^m \sum_{i=1}^n y_i \cdot log(\hat{y_i})$$

The prediction vector $\hat{Y}$ can be expressed as a vector of probabilites that sums up to 1. These probabilities can be obtained using the softmax function:

$$s_i = \frac{e^{z_i}}{\sum_{j=0}^9 e^{z_j}}$$

where vector $z$ is the input.

Forward propagation:

$z = w^TX +b $

where $\sigma$ is a Softmax function

Hence,

$s = \sigma (z) = \frac{e^{(w^TX +b)}}{\sum_{j=0}^9 e^{(w^TX +b)}} \ \ \ \ (1) $

We, then use backpropagation to adjust the weights and biases throughout the network based on the calculated cost so that the cost will be lower in the next iteration.

Before implementing the back propation step, we need to obtain the gradient or derivative of the loss with respect to weighted input $z$ of the output layer as follows. We let $\hat{y_i} = s_i$, which is the softmax output

$\frac{\partial L}{\partial z_j} = - \frac{\partial }{\partial z_j} \sum_{i=1}^n y_i \cdot log(s_i) $

$ \ \ \ \ \ \ = - \sum_{i=1}^n y_i \cdot \frac{\partial }{\partial z_j} log(s_i) \ , \ $ derive $\frac{\partial}{\partial z_j} log(s_i)$ below

$ \ \ \ \ \ \ = - \sum_{i=1}^n y_i \cdot \frac{\partial }{\partial z_j} log\Big(\frac{e^{z_i}}{\sum_{j=0}^9 e^{z_j}} \Big) $

$ \ \ \ \ \ \ = - \sum_{i=1}^n \frac{y_i}{s_i} \cdot \frac{\partial s_i}{\partial z_j} $

$ \ \ \ \ \ \ = - \sum_{i=1}^n \frac{y_i}{s_i} \cdot s_i \Big[ \mathbb{1} \{i=j\} - s_j \Big] $

$ \ \ \ \ \ \ = - \sum_{i=1}^n {y_i} \Big[ \mathbb{1} \{i=j\} - s_j \Big] $

$ \ \ \ \ \ \ = - \sum_{i=1}^n {y_i} \cdot \mathbb{1} \{i=j\} + \sum_{i=1}^n {y_i} \cdot s_j $

$ \ \ \ \ \ \ = - y_j + \sum_{i=1}^n {y_i} \cdot s_j $

$ \ \ \ \ \ \ = s_j \sum_{i=1}^n {y_i} - y_j \ \ , \ \ $ $\sum_{i=1}^n {y_i} = 1$ since one-hot encoded vector $Y$ sums to $1$

$ \ \ \ \ \ \ = s_j - y_j $

$ \frac{\partial L}{\partial z} = s - y $

$ \ \ \ \ \ \ $

$\frac{\partial z}{\partial w_j} = \frac{\partial }{\partial w_j}w^TX +b$

$ \ \ \ \ \ \ \ = \frac{\partial }{\partial w_j}(w_0x_0 + \cdots + w_nx_n +b)$

$ \ \ \ \ \ \ \ = w_j$

$ \ \ \ \ \ \ $

$\frac{\partial s}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}}{\partial z_j}$

$ \ \ \ \ \ = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}}{\partial z_j}$

$ \ \ \ \ \ = \frac{e^{z_i} \sum_{k=1}^N e^{z_k} - e^{z_j}e^{a_i}}{\left( \sum_{k=1}^N e^{z_k}\right)^2} $

$ \ \ \ \ \ = \frac{e^{z_i} \left( \sum_{k=1}^N e^{z_k} - e^{z_j}\right )}{\left( \sum_{k=1}^N e^{z_k}\right)^2} $

$ \ \ \ \ \ = \frac{ e^{z_j} }{\sum_{k=1}^N e^{z_k} } \times \frac{\left( \sum_{k=1}^N e^{z_k} - e^{z_j}\right ) }{\sum_{k=1}^N e^{z_k} }$

$ \ \ \ \ \ = s_i(1-s_j) \ \ , \ $ for $i = j$

$ \ \ \ \ \ \ $

$\frac{\partial s}{\partial z_j} = \frac{\partial \frac{e^{z_i}}{\sum_{k=1}^N e^{z_k}}}{\partial z_j}$

$ \ \ \ \ \ = \frac{0 - e^{z_j}e^{z_i}}{\left( \sum_{k=1}^N e^{z_k}\right)^2}$

$ \ \ \ \ \ = \frac{0 - e^{z_j}e^{z_i}}{\left( \sum_{k=1}^N e^{z_k}\right)^2} $

$ \ \ \ \ \ = \frac{- e^{z_j} }{\sum_{k=1}^N e^{z_k} } \times \frac{e^{z_i} }{\sum_{k=1}^N e^{z_k} }$

$ \ \ \ \ \ = - s_j.s_i \ \ , \ $ for $i \ne j$

Hence,

$\frac{\partial s_i}{\partial z_j} = s_i(1-s_j)$

$ \ \ \ \ \ \ $

To see how $L$ changes w.r.t. each component $w_j$ of $w$, we can compute $\frac{\partial L}{\partial w_j}$ using chain rule as follows

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial s} \frac{\partial s}{\partial z} \frac{\partial z}{\partial w_j}$

$ \ \ \ \ \ \ = \frac{\partial L}{\partial s} \frac{\partial s}{\partial z} \frac{\partial z}{\partial w_j}$

$ \ \ \ \ \ \ = (s-y)(s(1-s))w$

$ \ \ \ \ \ \ = (s-y)(s(1-s))w$

In vectorized form this is

$\frac{\partial L}{\partial w} = \frac{1 }{m} (s-y) X^T \ \ , \ $ where $s =$ predicted labels

$ \ \ \ \ \ \ $

Similarly,

$\frac{\partial z}{\partial b_j} = \frac{\partial }{\partial w_j}w^TX +b$

$ \ \ \ \ \ \ \ = \frac{\partial }{\partial b_j}(w_0x_0 + \cdots + w_nx_n +b)$

$ \ \ \ \ \ \ \ = 1$

Thus,

$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial s} \frac{\partial s}{\partial z} \frac{\partial z}{\partial b_j}$

$ \ \ \ \ \ \ = (s-y)(s(1-s))$

In vectorized form this is

$\frac{\partial L}{\partial b} = \frac{1 }{m} \sum_{i=1}^m(s-y)$

$ \ \ \ \ \ \ $

To adjust the weights in each iteration we use the gradient descent update rule with fixed stepsize $\alpha$: $w^{(k+1)} = w^k - \alpha \nabla f(x^k)$

so the new updated parameters for a single layer is

$w_1^{new} = w_1 - \alpha \frac{\partial L}{\partial w_1} $

$b_1^{new} = b_1 - \alpha \frac{\partial L}{\partial b_1} $

where $L$= average loss

Reference(s): https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

In [100]:
# Import dependencies 
import pandas as pd 
import nbconvert
import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import operator 
from operator import itemgetter
import plotly.express as px
import timeit
import plotly.io as pio
pio.renderers.default='notebook'
In [101]:
#one hot encoding function 
def one_hot_enc(Y):
    t = np.zeros((Y.shape[0], 10))
    for i in range(Y.shape[0]):
        t[i][int(Y[i][0])] = 1 
    return t

#normalization function  
def normalize(X): 
    X = X / 255 #divide by 255 since each pixel-value is a grayscale integer between 0 and 255
    return X 
In [102]:
#Load dataset 
M = loadmat('..\HW1\data\MNIST_digit_data.mat')

#assign labels and pre-normalized images from dictionary into individual arrays  
images_train,images_test,labels_train,labels_test= M['images_train'],M['images_test'],M['labels_train'],M['labels_test']

#set random seed 
np.random.seed(1)

#randomly permute data points
inds = np.random.permutation(images_train.shape[0])
images_train = images_train[inds]
labels_train = labels_train[inds]

inds = np.random.permutation(images_test.shape[0])
images_test = images_test[inds]
labels_test = labels_test[inds]
In [103]:
#one hot encode labels 
labels_train_enc = one_hot_enc(labels_train).astype(float)
labels_test_enc = one_hot_enc(labels_test).astype(float)

#transpose training and test arrays to avoid transposing 
#them in the following functions 
X_train = images_train.T
Y_train = labels_train_enc.T
X_test = images_test.T
Y_test = labels_test_enc.T

#display final test and train data sizes 
print(X_train.shape, Y_train.shape)
print(X_test.shape,Y_test.shape) 
(784, 60000) (10, 60000)
(784, 10000) (10, 10000)
In [5]:
#softmax function 
def softmax(z):
    """
    Computes softmax 
    
    Inputs: 
    - z: result of W^T.X+b
    """
    return np.exp(z) / np.sum(np.exp(z), axis=0)

#loss function (cross entropy loss)
def loss(Y, Y_pred): 
    """
    Computes average loss  
    
    Inputs: 
    - z: result of W^T.X+b
    """
    
    log_sum = -np.sum(np.multiply(Y, np.log(Y_pred))) #log sum
    m = Y.shape[1]
    L = (1./m) * log_sum #average over m=10 training samples
    return L

#accuracy function 
def accuracy(labels, predictions): 
    """
    Computes test accuracy for every iteration = number of batches x epochs   
    
    Inputs: 
    - labels: actual labels (one-hot encoded)
    - predictions: predicted labels 
    """
    total_correct = 0 
    m = labels.shape[1] #number of instances contained in array of actual labels 
    predictions = np.argmax(predictions, axis=0) #extract index of largest probability which represents predicted label
    labels = np.argmax(labels, axis=0) #extract index of largest value (equals to one), which represents actual label 
    #compare indices and count how many of them are the same 
    for i in range(len(labels)):
        if (predictions[i] == labels[i]):
            total_correct += 1
    return total_correct/m*100 
In [117]:
#forward propagation function 
def forward_fc(X, params):
    """
    Computes the forward pass for an affine fully-connected layer 
    
    Inputs: 
    - X: Input training images batch (dimension: 784 x batch size)
    - params: Weights (dimension: 10x784) and Bias (dimension: 10x1)
    """
    
    dict_f={} #initialize empty dictionary for forward results 
    
    # input layer to l1: Z1 = w1^T*x + b1 
    dict_f['z1'] = np.matmul(params['w1'],X) + params['b1'] #output 
    dict_f['l1'] = softmax(dict_f['z1']) #get probabilities of the output using softmax 
     
    return dict_f 

#backward propagation function 
def backward_fc(X,Y,params,dict_f,batch_size):
    """
    Computes the backward pass for an affine fully-connected layer 
    
    Inputs: 
    - X: Input training images batch (dimension: 784 x batch size)
    - Y: Input taining labels batch (dimension: 10 x batch size)
    - params: Weights (dimension: 10x784) and Bias (dimension: 10x1)
    - dict_f: output of forward propagation process 
    - batch_size: size of every batch 
    """
    
    #intiliaze empty dictionary for results 
    dict_b = {}
    
    #compute derivatives of layer 2 wrt z, w, and b 
    dz1 = dict_f['l1'] - Y 
    dict_b['dw1'] = (1./batch_size) * np.matmul(dz1, X.T)
    dict_b['db1'] = (1./batch_size) * np.sum(dz1, axis=1, keepdims=True) #bias update 

    return dict_b
In [20]:
#single-layer fully connected network with mini batch SGD 

def mini_batch_fc(X_train,Y_train, X_test, Y_test, input_size = 28*28, output_size = 10,  batch_size = 10, epoch = 3, rand_sample = True): 
    np.random.seed(11)
    
    #initialize
    m = X_train.shape[1] #number of training samples
    beta = 0.9 #weight parameter for moving average between 0 and 1 (higher to smooth out update) 
    learning_rate = 0.1 
    batches = int(X_train.shape[1] / batch_size) #number of batches
    
    #initilialize empty accuracy array
    results_per_iter = []
    results_per_epoch = []

    #initialize parameters
    #divide by sqrt(n) to adjust the variance to 1/n
    params = { "w1": np.random.randn(output_size, input_size) * np.sqrt(1. / input_size),
               "b1": np.zeros((output_size, 1)) * np.sqrt(1. / input_size) }



    #initialize exponential moving average of gradients as 0
    #needed to add momentum 
    dw1_v = np.zeros(params["w1"].shape)
    db1_v = np.zeros(params["b1"].shape)
    
    #loop through each batch of samples 

    for n in range(epoch):

        print("*** Epoch {} ***".format(n))  
        
        if (rand_sample==True): 
            #randomly permute column indices                   
            indices = np.random.permutation(m)
            X = X_train[:, indices]
            Y = Y_train[:, indices]
        else: 
            X = X_train
            Y = Y_train
            
        #counter 
        count = 0 

        #iterate through each batch of 10 samples in m total training samples  
        for i in range(0, m, batch_size):

            #assign i-th batch to variables 
            X_i = X[:, i:i+batch_size]
            Y_i = Y[:, i:i+batch_size]

            #perform forward and backward pass
            dict_f = forward_fc(X_i, params)
            dict_b = backward_fc(X_i, Y_i, params, dict_f, batch_size)

            #update parameters using GD update rule with fixed learning rate  
            params["w1"] = params["w1"] - learning_rate * dict_b["dw1"]
            params["b1"] = params["b1"] - learning_rate * dict_b["db1"]

            #train 
            dict_f =forward_fc(X_train, params)
            train_loss = loss(Y_train, dict_f["l1"])

            #test 
            dict_f = forward_fc(X_test, params)
            test_loss = loss(Y_test, dict_f["l1"])
            acc = accuracy(Y_test, dict_f["l1"])

            #update counter 
            count += 1 

            #save results 
            iter_results = {'iteration': count , 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
            results_per_iter.append(iter_results) 

            #display results
            c = batches/6
            if (count%c == 0) : 
                print("Training {}: training loss = {}, test loss = {}, test accuracy = {} ".format(count,train_loss, test_loss, acc))

        #train 
        dict_f =forward_fc(X_train, params)
        train_loss = loss(Y_train, dict_f["l1"])

        #test 
        dict_f = forward_fc(X_test, params)
        test_loss = loss(Y_test, dict_f["l1"])
        acc = accuracy(Y_test, dict_f["l1"])

        #save results 
        e = n+1 
        results = {'epoch': e, 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
        results_per_epoch.append(results)

        print("Training done!")

    return results_per_iter, results_per_epoch, dict_f, dict_b, params
In [21]:
%%time 
results_per_iter, results_per_epoch, dict_f, dict_b, params = mini_batch_fc(X_train,Y_train, X_test, Y_test, input_size = 28*28, output_size = 10,  batch_size = 10, epoch = 3, rand_sample = True)
*** Epoch 0 ***
Training 1000: training loss = 0.37615839867988443, test loss = 0.3599463301678172, test accuracy = 89.96 
Training 2000: training loss = 0.3269326732542967, test loss = 0.3144352901435516, test accuracy = 91.10000000000001 
Training 3000: training loss = 0.3287269554577137, test loss = 0.3255100510059334, test accuracy = 90.67 
Training 4000: training loss = 0.3204995438744067, test loss = 0.31679168268565683, test accuracy = 90.85 
Training 5000: training loss = 0.306562028891086, test loss = 0.30231847903932185, test accuracy = 91.06 
Training 6000: training loss = 0.2983037603144213, test loss = 0.2956442065832815, test accuracy = 91.57 
Training done!
*** Epoch 1 ***
Training 1000: training loss = 0.29257699723112746, test loss = 0.2930958036275988, test accuracy = 91.8 
Training 2000: training loss = 0.30194958275611844, test loss = 0.3059924591959284, test accuracy = 91.34 
Training 3000: training loss = 0.2825303936437372, test loss = 0.28377832939147624, test accuracy = 91.74 
Training 4000: training loss = 0.28440182640761935, test loss = 0.28700242833715667, test accuracy = 91.96 
Training 5000: training loss = 0.2814790892559006, test loss = 0.2857230830952564, test accuracy = 92.05 
Training 6000: training loss = 0.2789279010135515, test loss = 0.28799861825356926, test accuracy = 92.0 
Training done!
*** Epoch 2 ***
Training 1000: training loss = 0.2805534258884522, test loss = 0.28771444650791445, test accuracy = 92.0 
Training 2000: training loss = 0.2915605419481089, test loss = 0.29457788364437537, test accuracy = 91.73 
Training 3000: training loss = 0.2760029088285485, test loss = 0.2846294907466463, test accuracy = 92.12 
Training 4000: training loss = 0.2942687443367019, test loss = 0.30076067154865577, test accuracy = 91.28 
Training 5000: training loss = 0.2756507231819271, test loss = 0.2865742773133869, test accuracy = 91.86 
Training 6000: training loss = 0.27494525432394273, test loss = 0.2845414392539408, test accuracy = 92.21000000000001 
Training done!
CPU times: total: 1h 25min 38s
Wall time: 16min 31s
In [15]:
df_iter = pd.DataFrame.from_dict(results_per_iter)
df_iter
Out[15]:
iteration train_loss test_loss test_accuracy
0 1 2.317132 2.323152 15.12
1 2 2.248185 2.254529 21.19
2 3 2.133389 2.133171 24.72
3 4 2.050330 2.056600 19.90
4 5 1.897787 1.895859 38.00
... ... ... ... ...
17995 5996 0.275367 0.283664 92.21
17996 5997 0.274989 0.283700 92.22
17997 5998 0.275292 0.284007 92.24
17998 5999 0.273485 0.282479 92.30
17999 6000 0.274945 0.284541 92.21

18000 rows × 4 columns

In [104]:
#Plot the accuracy on all test data for every n iteration
import plotly
import plotly.express as px
import plotly.graph_objects as go

fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:', 
    yanchor="top",
    y=0.25,
    xanchor="left",
    x=0.85), title = 'Test accuracy for every iteration')

fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')

subop = {'Epoch 1': df_iter[ 'test_accuracy'][0:6000,],
         'Epoch 2': df_iter[ 'test_accuracy'][6000:12000,],
         'Epoch 3': df_iter[ 'test_accuracy'][12000:18000,] }

for k, v in subop.items():
    fig.add_scatter(x=v.index, y = v, name = k )

fig.show()
In [17]:
df_epoch = pd.DataFrame.from_dict(results_per_epoch)
df_epoch
Out[17]:
epoch train_loss test_loss test_accuracy
0 1 0.298304 0.295644 91.57
1 2 0.278928 0.287999 92.00
2 3 0.274945 0.284541 92.21

2.)¶

For each class, visualize the 10 images that are misclassified with the highest score along with their predicted label and score. These are very confident wrong predictions.

Answer:

In [34]:
#extract misclassified labels 
misc = []
a = np.argmax(Y_test, axis=0)
b = np.argmax(dict_f["l1"], axis=0)

for i in range(len(a)):
    if(a[i] !=b[i] ): 
        a_prob = dict_f["l1"][:, i][a[i]] #probability for actual 
        p_prob = np.amax(dict_f["l1"][:, i]) #probability for predicted 
        misc_i ={'i_a':i , 'a_label': a[i], 'a_prob':a_prob ,  'p_label': b[i], 'p_prob': p_prob}
        misc.append(misc_i) 
misc_df = pd.DataFrame.from_dict(misc)
misc_df
Out[34]:
i a_label a_prob p_label p_prob
0 2 5 0.180325 3 0.685709
1 15 8 0.117822 6 0.731380
2 26 5 0.211179 3 0.786947
3 28 4 0.014444 1 0.891716
4 31 5 0.084143 8 0.842130
... ... ... ... ... ...
774 9973 8 0.320126 9 0.555686
775 9977 2 0.106816 8 0.766568
776 9979 3 0.447184 0 0.487849
777 9986 8 0.046612 0 0.947382
778 9990 8 0.169839 4 0.398347

779 rows × 5 columns

In [26]:
#Top 1 misclassified per category 
max_misc = misc_df.groupby('a_label')['p_prob'].max()
#print(max_misc)
max_misc_df = misc_df[misc_df['p_prob'].isin(max_misc)].sort_values('a_label')
print(max_misc_df)
        i  a_label    a_prob  p_label    p_prob
203  2716        0  0.006809        6  0.955316
700  9046        1  0.006577        6  0.976817
678  8729        2  0.000492        7  0.999123
106  1294        3  0.002487        2  0.997257
144  1948        4  0.009762        6  0.987989
254  3385        5  0.000138        6  0.999172
549  7204        6  0.004092        0  0.995046
671  8647        7  0.000739        2  0.998289
202  2712        8  0.000069        4  0.998524
529  6976        9  0.004846        4  0.994071
In [63]:
#Top 10 misclassified images per predicted category 
##group by predicted label and sort predicted label scores 
max_misc = misc_df.groupby(['p_label']).apply(lambda x: x.sort_values(['p_prob'], ascending = False))
max_misc.reset_index(drop = True, inplace = True)
##keep only top 10 scores  
max_misc = max_misc.groupby('p_label').head(10)
max_misc['p_prob']=max_misc['p_prob'].round(4)
max_misc
Out[63]:
i a_label a_prob p_label p_prob
0 4729 5 0.000057 0 0.9982
1 7204 6 0.004092 0 0.9950
2 9592 6 0.012235 0 0.9877
3 1208 9 0.000834 0 0.9876
4 3357 4 0.000004 0 0.9863
... ... ... ... ... ...
708 594 4 0.040770 9 0.9021
709 7264 4 0.055380 9 0.8927
710 2325 4 0.085306 9 0.8855
711 4310 2 0.003603 9 0.8806
712 4367 4 0.123009 9 0.8526

100 rows × 5 columns

The following shows the images that are misclassified as class $i$ with corresponding scores on the left side of each image

In [105]:
#visualize number that are highly misclassified 
import matplotlib.pyplot as plt
# create figure
fig = plt.figure(figsize=(20, 20))

rows = 10
columns = 10

for n in range(100):
    # Adds a subplot at the 1st position
    fig.add_subplot(rows, columns, n+1)

    ind = int(max_misc.iloc[n]['i'])
    im = X_test[:, ind].reshape((28,28),order='F')
    plt.imshow(im)
    #plt.suptitle('Actual Label:'+str(max_misc_df.loc[max_misc_df['i'] == ind, 'a_label'].iloc[0]))
    plt.title('Predicted Label: ' +str(max_misc.loc[max_misc['i'] == ind, 'p_label'].iloc[0]))
    plt.ylabel('Score: '+ str(max_misc.loc[max_misc['i'] == ind, 'p_prob'].iloc[0]))
    plt.tick_params(left = False, right = False , labelleft = False ,
                labelbottom = False, bottom = False)
    #plt.show()

3.)¶

Please reduce the number of training data to 1 example per class (chosen randomly from training data) and plot the curve (accuracy vs, iterations). The whole training data will be 10 images only.

In [70]:
#convert to dataframe 
df = pd.DataFrame(labels_train, columns =['train_label']) 
df['index'] = df.index
In [71]:
#random sampling 
size = 1        # sample size
replace = True  # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df_rand = df.groupby('train_label', as_index=False).apply(fn)
df_rand
Out[71]:
train_label index
0 43577 0 43577
1 36932 1 36932
2 9841 2 9841
3 53134 3 53134
4 11712 4 11712
5 1615 5 1615
6 6165 6 6165
7 59774 7 59774
8 27124 8 27124
9 48915 9 48915
In [72]:
#use random indices to filter original training data 
X_train_rs = np.array([X_train[:, index] for index in df_rand['index']])
X_train_rs = X_train_rs.T
X_train_rs.shape
Out[72]:
(784, 10)
In [73]:
Y_train_rs = np.array([Y_train[:, index] for index in df_rand['index']])
Y_train_rs = Y_train_rs.T
Y_train_rs.shape
Out[73]:
(10, 10)
In [74]:
#check filter result 
labels_train[59255]
Out[74]:
array([0], dtype=uint8)
In [75]:
Y_train_rs
Out[75]:
array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
In [81]:
%%time
#Train 
results_per_iter3, results_per_epoch3, dict_f3, dict_b3, params3 = mini_batch_fc(X_train_rs, Y_train_rs, X_test, Y_test,  input_size = 28*28, output_size = 10,  batch_size = 10, epoch = 3, rand_sample = True)
*** Epoch 0 ***
Training done!
*** Epoch 1 ***
Training done!
*** Epoch 2 ***
Training done!
CPU times: total: 250 ms
Wall time: 64.8 ms
In [82]:
df_iter1 = pd.DataFrame.from_dict(results_per_iter3)
df_iter1
Out[82]:
iteration train_loss test_loss test_accuracy
0 1 1.791778 2.216344 17.20
1 1 1.379278 2.114741 25.36
2 1 1.075647 2.036218 31.57
In [106]:
#Plot results 
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:', 
    yanchor="top",
    y=0.25,
    xanchor="left",
    x=0.85), title = 'Test accuracy for every iteration')

fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')

subop = {'Epoch 1': df_iter1[ 'test_accuracy'],
         'Epoch 2': df_iter1[ 'test_accuracy'],
         'Epoch 3': df_iter1[ 'test_accuracy']}

for k, v in subop.items():
    fig.add_scatter(x=v.index, y = v, name = k )

fig.show()

Note: Since the whole training data has only 10 instances, then the number of iterations for batch size of 10 is only 1.

4.)¶

Try different mini-batch sizes (1, 10, 100) for the original case and plot the results. Which one is better and why?

Answer:

Since I already obtained results for batch size = 10 in part (1), I am using batch sizes 1 and 100 below.

Batch size = 1

To save time, I'm only running this in 2 epochs

In [116]:
%%time
#Train batch size=1
results_per_ite4a, results_per_epoch4a, dict_f4a, dict_b4a, params4a = mini_batch_fc(X_train, Y_train, X_test, Y_test, input_size = 28*28, output_size = 10, batch_size = 1, epoch = 2, rand_sample = True)
*** Epoch 0 ***
Training 10000: training loss = 0.9546637752152007, test loss = 0.9358143878755834, test accuracy = 84.76 
Training 20000: training loss = 0.8806656767540756, test loss = 0.8741450320052128, test accuracy = 86.17 
Training 30000: training loss = 0.7634004504789411, test loss = 0.8054683434113198, test accuracy = 88.58 
Training 40000: training loss = 0.8282999818547326, test loss = 0.8749206799428325, test accuracy = 87.66000000000001 
Training 50000: training loss = 0.762895691864133, test loss = 0.8022676882358395, test accuracy = 88.53999999999999 
Training 60000: training loss = 0.8669361741589076, test loss = 0.8844366358183816, test accuracy = 87.94999999999999 
Training done!
*** Epoch 1 ***
Training 10000: training loss = 0.8185169450418242, test loss = 0.8519542665320383, test accuracy = 88.58 
Training 20000: training loss = 0.9537208859597216, test loss = 1.0576902741717993, test accuracy = 85.28 
Training 30000: training loss = 0.7452535705334342, test loss = 0.7561818257201999, test accuracy = 89.12 
Training 40000: training loss = 0.6526562792931612, test loss = 0.7058862962016057, test accuracy = 90.79 
Training 50000: training loss = 0.8486276846957119, test loss = 0.9006482851497486, test accuracy = 88.17 
Training 60000: training loss = 0.801947417161944, test loss = 0.8802016562949941, test accuracy = 89.21 
Training done!
CPU times: total: 9h 18min 9s
Wall time: 1h 48min 12s
In [118]:
df_iter2 = pd.DataFrame.from_dict(results_per_ite4a)
df_iter2
Out[118]:
iteration train_loss test_loss test_accuracy
0 1 3.121829 3.155109 10.32
1 2 2.708463 2.726815 10.24
2 3 3.816171 3.855504 11.35
3 4 4.352786 4.410227 10.45
4 5 3.177153 3.201579 22.85
... ... ... ... ...
119995 59996 0.801954 0.880207 89.21
119996 59997 0.801947 0.880202 89.21
119997 59998 0.801947 0.880202 89.21
119998 59999 0.801947 0.880202 89.21
119999 60000 0.801947 0.880202 89.21

120000 rows × 4 columns

In [119]:
#Plot results 
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:', 
    yanchor="top",
    y=0.25,
    xanchor="left",
    x=0.85), title = 'Test accuracy for every iteration')

fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')

subop = {'Epoch 1': df_iter2[ 'test_accuracy'][0:60000,],
         'Epoch 2': df_iter2[ 'test_accuracy'][60000:120000,]}

for k, v in subop.items():
    fig.add_scatter(x=v.index, y = v, name = k )

fig.show()

Batch size = 100

In [107]:
%%time
#Train batch size=100
results_per_iter4b, results_per_epoch4b, dict_f4b, dict_b4b, params4b  = mini_batch_fc(X_train, Y_train, X_test, Y_test, input_size = 28*28,  output_size = 10, batch_size = 100, epoch = 3, rand_sample = True)
*** Epoch 0 ***
Training 100: training loss = 0.6148466417513304, test loss = 0.5910929348083898, test accuracy = 86.67 
Training 200: training loss = 0.4925443093858635, test loss = 0.4700176354280208, test accuracy = 88.5 
Training 300: training loss = 0.44590489595391625, test loss = 0.42565624646282363, test accuracy = 88.88000000000001 
Training 400: training loss = 0.41421927993752855, test loss = 0.3937721639997466, test accuracy = 89.8 
Training 500: training loss = 0.3956477517837601, test loss = 0.3776185653570958, test accuracy = 89.77000000000001 
Training 600: training loss = 0.38068133522531516, test loss = 0.3629114440229983, test accuracy = 90.24 
Training done!
*** Epoch 1 ***
Training 100: training loss = 0.36966974208364933, test loss = 0.35288598561614426, test accuracy = 90.64999999999999 
Training 200: training loss = 0.3619074916867618, test loss = 0.34547338335220235, test accuracy = 90.77 
Training 300: training loss = 0.3551972133200988, test loss = 0.3395805130693344, test accuracy = 90.86999999999999 
Training 400: training loss = 0.34876157930517415, test loss = 0.33277139980258014, test accuracy = 91.09 
Training 500: training loss = 0.343236119839324, test loss = 0.32840843255849345, test accuracy = 91.3 
Training 600: training loss = 0.3384802777415614, test loss = 0.3237241705498349, test accuracy = 91.2 
Training done!
*** Epoch 2 ***
Training 100: training loss = 0.3362124692184101, test loss = 0.3226848908023063, test accuracy = 91.07 
Training 200: training loss = 0.33184906121395225, test loss = 0.31771433777198294, test accuracy = 91.14 
Training 300: training loss = 0.32795963442747733, test loss = 0.31464079397526273, test accuracy = 91.47 
Training 400: training loss = 0.32572288253606196, test loss = 0.31232695850658476, test accuracy = 91.53 
Training 500: training loss = 0.3225478509666107, test loss = 0.3096462537886367, test accuracy = 91.53999999999999 
Training 600: training loss = 0.3199326515325162, test loss = 0.30773874704328297, test accuracy = 91.55 
Training done!
CPU times: total: 8min 20s
Wall time: 1min 46s
In [108]:
df_iter3 = pd.DataFrame.from_dict(results_per_iter4b)
df_iter3
Out[108]:
iteration train_loss test_loss test_accuracy
0 1 2.230907 2.235815 18.61
1 2 2.125183 2.129207 27.80
2 3 2.041743 2.044509 37.72
3 4 1.955124 1.956896 42.51
4 5 1.879467 1.877801 53.50
... ... ... ... ...
1795 596 0.319732 0.307941 91.41
1796 597 0.319625 0.307548 91.57
1797 598 0.319943 0.307623 91.45
1798 599 0.319669 0.307397 91.53
1799 600 0.319933 0.307739 91.55

1800 rows × 4 columns

In [109]:
#Plot results 
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:', 
    yanchor="top",
    y=0.25,
    xanchor="left",
    x=0.85), title = 'Test accuracy for every iteration')

fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')

subop = {'Epoch 1': df_iter3[ 'test_accuracy'][0:600,],
         'Epoch 2': df_iter3[ 'test_accuracy'][600:1200,],
         'Epoch 3': df_iter3[ 'test_accuracy'][1200:1800,]}

for k, v in subop.items():
    fig.add_scatter(x=v.index, y = v, name = k )

fig.show()
In [110]:
df_epoch4b = pd.DataFrame.from_dict(results_per_epoch4b)
df_epoch4b
Out[110]:
epoch train_loss test_loss test_accuracy
0 1 0.380681 0.362911 90.24
1 2 0.338480 0.323724 91.20
2 3 0.319933 0.307739 91.55

Observation(s):

Based on the results I obtained, it appears that batch size 10 is better since it produces higher test accuracy at each epoch compare to batch size 100 even if it is slightly slower. Batch size 1 produces less accuracy and is much slower than batch sizes 10 and 100. In this case, one possible reason could be that the larger batch size (100) do fewer and coarses search steps for the optimal solution. Thus, it will be less likely to converge compare to smaller batch size (10). Even though smaller batch size (10) oscillates more, which means it has more noise, than batch size (100), this noise helps come out of the local minima in non-convex cases which is often observed. Using batch size 1, there is much more noise that convergence is less likely to occur.

5.)¶

Instead of using random sampling, sort the data before training so that all ”1”s appear before ”2”s and so on. Then, sample sequentially in running SGD instead of random sampling. Does this work well, why?

In [84]:
#Sort labels_train 
df = pd.DataFrame(labels_train, columns =['train_label']) 
df['index'] = df.index
df_sorted = df.sort_values('train_label')
df_sorted
Out[84]:
train_label index
38142 0 38142
44820 0 44820
44815 0 44815
8331 0 8331
8330 0 8330
... ... ...
54922 9 54922
33204 9 33204
18688 9 18688
28145 9 28145
38140 9 38140

60000 rows × 2 columns

In [85]:
#use sorted indices for images 
X_train_sorted= np.array([X_train[:, index] for index in df_sorted['index']])
X_train_sorted = X_train_sorted.T
X_train_sorted.shape
Out[85]:
(784, 60000)
In [86]:
#use sorted indices for labels  
Y_train_sorted = np.array([Y_train[:, index] for index in df_sorted['index']])
Y_train_sorted = Y_train_sorted.T
Y_train_sorted.shape
Out[86]:
(10, 60000)
In [96]:
Y_train_sorted
Out[96]:
array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 1., 1.]])
In [97]:
%%time
#Train 
results_per_iter5, results_per_epoch5, dict_f5, dict_b5, params5 =mini_batch_fc(X_train_sorted, Y_train_sorted, X_test, Y_test, input_size = 28*28, output_size = 10,  batch_size = 10, epoch = 3, rand_sample = False)
*** Epoch 0 ***
Training 1000: training loss = 7.6560136185957965, test loss = 7.784962497036138, test accuracy = 19.82 
Training 2000: training loss = 6.823384971637918, test loss = 6.962241353393071, test accuracy = 10.26 
Training 3000: training loss = 8.046814939336956, test loss = 8.180553752444181, test accuracy = 9.86 
Training 4000: training loss = 6.964878336372151, test loss = 7.144777356797537, test accuracy = 9.610000000000001 
Training 5000: training loss = 6.004155071774029, test loss = 6.123391019369141, test accuracy = 10.059999999999999 
Training 6000: training loss = 7.2011510893257045, test loss = 7.324521361934094, test accuracy = 10.09 
Training done!
*** Epoch 1 ***
Training 1000: training loss = 4.877579851034547, test loss = 4.945013013043592, test accuracy = 21.349999999999998 
Training 2000: training loss = 4.7513757967582615, test loss = 4.8532523920743085, test accuracy = 12.809999999999999 
Training 3000: training loss = 5.792010286794085, test loss = 5.8955493727586035, test accuracy = 18.3 
Training 4000: training loss = 4.952379765511491, test loss = 5.10419580281234, test accuracy = 12.280000000000001 
Training 5000: training loss = 5.075148569930429, test loss = 5.16851943637267, test accuracy = 16.189999999999998 
Training 6000: training loss = 5.849866551038217, test loss = 5.960995197913896, test accuracy = 10.530000000000001 
Training done!
*** Epoch 2 ***
Training 1000: training loss = 4.006515437987527, test loss = 4.0482843741270615, test accuracy = 26.810000000000002 
Training 2000: training loss = 4.105667161852119, test loss = 4.195502118034864, test accuracy = 16.39 
Training 3000: training loss = 5.24883062562635, test loss = 5.347894459254287, test accuracy = 23.880000000000003 
Training 4000: training loss = 4.339424038936531, test loss = 4.478378353012093, test accuracy = 18.38 
Training 5000: training loss = 4.765526663587499, test loss = 4.855919003890725, test accuracy = 19.3 
Training 6000: training loss = 5.398176060691625, test loss = 5.505538818901623, test accuracy = 13.91 
Training done!
CPU times: total: 1h 25min 1s
Wall time: 16min 13s
In [98]:
df_iter5 = pd.DataFrame.from_dict(results_per_iter5)
df_iter5
Out[98]:
iteration train_loss test_loss test_accuracy
0 1 3.957248 4.041774 9.80
1 2 4.027062 4.113994 9.80
2 3 4.100136 4.189165 9.80
3 4 4.120589 4.210263 9.80
4 5 4.167508 4.258917 9.80
... ... ... ... ...
17995 5996 5.395582 5.502895 13.92
17996 5997 5.395785 5.503103 13.92
17997 5998 5.396005 5.503328 13.92
17998 5999 5.396370 5.503697 13.92
17999 6000 5.398176 5.505539 13.91

18000 rows × 4 columns

In [99]:
#Plot results 
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:', 
    yanchor="top",
    y=0.85,
    xanchor="left",
    x=0.95), title = 'Test accuracy for every iteration')

fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')

subop = {'Epoch 1': df_iter5[ 'test_accuracy'][0:6000,],
         'Epoch 2': df_iter5[ 'test_accuracy'][6000:12000,],
         'Epoch 3': df_iter5[ 'test_accuracy'][12000:18000,]}

for k, v in subop.items():
    fig.add_scatter(x=v.index, y = v, name = k )

fig.show()

Observation(s):

Based on the results above, it appears that using a sorted training data does not work very well because each batch contains instances from the same classes. It is important to shuffle the data to make sure that the variance is reduced and the model is generalized as much as possible. In this case, since mostly each batch contains instances from one class, each training batch does not represent the overall distribution of the data, which results in inaccurate gradients and causes the accuracy to decrease. If we always have the data point (say point 11) after data point 10 at every epoch, it's gradient will be biased towards the model update resulting from point 10. Hence, we need to shuffle the data so the updates created by each data point or batches are independent.

6.)¶

(Bonus point) Add a hidden layer with 11 hidden neurons and ReLU activation function. Then, plot the accuracy curve to see how the accuracy changes.

Note: if you want to solve the bonus question since there are two more layers (a linear layer and a ReLU activation layer). If you want to do the bonus too, you may want to code back-propagation by calculating the derivative formula for each layer and hard-coding it into your python code. Then, you can write code that multiplies them to get the final gradient as discussed in the class. If you go this route, you can use it for both bonus and non-bonus parts of the assignment

Answer:

  • ReLU:
$$f(x) = max(x, 0)$$

$$f(x) = \begin{cases} {x},& \text{if } x\ge 0\\ 0, & \text{otherwise} \end{cases}$$
  • Derivative of ReLU:
$$f'(x) = \begin{cases} {1},& \text{if } x> 0\\ 0, & \text{otherwise} \end{cases}$$
In [111]:
#relu function 
def relu(r):
    return np.maximum(r, 0)
    #return np.array([1 if i>0 else 0 for i in r])

#relu derivative function 
def d_relu(Z):
      return Z > 0

#softmax function 
def softmax(x):
    #exp=np.exp(x-x.max())
    #return exp/np.sum(exp,axis=0)*(1-exp/np.sum(exp,axis=0)) 
    return np.exp(x) / np.sum(np.exp(x), axis=0)

#loss function (cross entropy loss)
def loss(Y, Y_pred):  
    log_sum = -np.sum(np.multiply(Y, np.log(Y_pred))) #log sum
    m = Y.shape[1]
    L = (1./m) * log_sum #average over m=10 training samples
    return L

#accuracy function 
def accuracy(labels, predictions): 
    total_correct = 0 
    m = labels.shape[1]
    predictions = np.argmax(predictions, axis=0)
    labels = np.argmax(labels, axis=0)
    for i in range(len(labels)):
        if (predictions[i] == labels[i]):
            total_correct += 1
    return total_correct/m*100 

#forward propagation function 
def forward(X, params):
    dict_f={} #initialize empty dictionary for forward results 
    
    # input layer to l1: Z1 = w1^T*x + b1 
    dict_f['z1'] = np.matmul(params['w1'],X) + params['b1']
    dict_f['l1'] = relu(dict_f['z1']) #using equation (1) above 
    
    # l1 to output layer using softmax: Z2 = w2^T*l1 + b2  
    dict_f['z2'] = np.matmul(params['w2'],dict_f['l1']) + params['b2']
    dict_f['l2']  = softmax(dict_f['z2'] ) #get probabilities of the output using softmax 
     
    return dict_f 

#backward propagation function 
def backward(X,Y,params,dict_f,batch_size):
    
    #intiliaze empty dictionary for results 
    dict_b = {}
    
    #compute derivatives of layer 2 wrt z, w, and b 
    dz2 = dict_f['l2'] - Y 
    dict_b['dw2'] = (1./batch_size) * np.matmul(dz2, dict_f["l1"].T)
    dict_b['db2'] = (1./batch_size) * np.sum(dz2, axis=1, keepdims=True) #bias update 

    #compute derivatives of layer 1 wrt z, w, and b 
    dl1 = np.matmul(params["w2"].T, dz2)
    dz1 = dl1 * d_relu(dict_f["z1"]) 
    dict_b['dw1'] = (1./batch_size) * np.matmul(dz1, X.T)
    dict_b['db1'] = (1./batch_size) * np.sum(dz1, axis=1, keepdims=True)

    return dict_b
In [112]:
np.random.seed(11)

#initialize hyperparameters
m = X_train.shape[1] #number of training samples
beta = 0.9 #weight parameter for moving average between 0 and 1 (higher to smooth out update) 
learning_rate = 0.1 
epoch = 3 #number of epochs
batch_size = 10 # batch size 
input_size = 28*28 #num of inputs
output_size = 10 #num of outputs = 10 digits 
hidden_size = 11 #hidden size1 
n = 10 #number of iterations 
batches = int(X_train.shape[1] / batch_size) #number of batches

#initilialize empty accuracy array
results_per_iter = []
results_per_epoch = []

#initialize parameters
#divide by sqrt(n) to adjust the variance to 1/n
params = { "w1": np.random.randn(hidden_size, input_size) * np.sqrt(1. / input_size),
           "b1": np.zeros((hidden_size, 1)) * np.sqrt(1. / input_size),
           "w2": np.random.randn(output_size, hidden_size) * np.sqrt(1. / hidden_size),
           "b2": np.zeros((output_size, 1)) * np.sqrt(1. / hidden_size) }

#initialize exponential moving average of gradients as 0
#needed to add momentum 
dw1_v = np.zeros(params["w1"].shape)
db1_v = np.zeros(params["b1"].shape)
dw2_v = np.zeros(params["w2"].shape)
db2_v = np.zeros(params["b2"].shape)
In [113]:
%%time 
#loop through each batch of samples 
np.random.seed(11)

for n in range(epoch):
    print("*** Epoch {} ***".format(n))                                
    #randomly permute column indices
    #indices = np.random.choice(X_train.shape[1], batch_size, replace=False)
    #sampling with replacement                    
    indices = np.random.permutation(m)
    X = X_train[:, indices]
    Y = Y_train[:, indices]
    
    #counter 
    count = 0 
    
    #iterate through each batch of 10 samples in m total training samples  
    for i in range(0, m, batch_size):
        
        #assign i-th batch to variables 
        X_i = X[:, i:i+batch_size]
        Y_i = Y[:, i:i+batch_size]
        
        #normalize input betw [-1,1]
        #a = -1 
        #b = 1 
        #X_i = (b-a)* np.divide((X_i-X_i.min()), (X_i.max() - X_i.min()))+ a 
        #Y_i = (b-a)* np.divide((Y_i-Y_i.min()), (Y_i.max() - Y_i.min()))+ a 
        
        #perform forward and backward pass
        dict_f = forward(X_i, params)
        dict_b = backward(X_i, Y_i, params, dict_f, batch_size)

        #update moving average of gradients 
        dw1_v = (beta * dw1_v) + (1. - beta) * dict_b["dw1"]
        db1_v = (beta * db1_v) + (1. - beta) * dict_b["db1"]
        dw2_v = (beta * dw2_v) + (1. - beta) * dict_b["dw2"]
        db2_v = (beta * db2_v) + (1. - beta) * dict_b["db2"]

        #update parameters/weights using GD update rule with fixed learning rate  
        params["w1"] = params["w1"] - learning_rate * dw1_v
        params["b1"] = params["b1"] - learning_rate * db1_v
        params["w2"] = params["w2"] - learning_rate * dw2_v
        params["b2"] = params["b2"] - learning_rate * db2_v

        #train 
        dict_f =forward(X_train, params)
        train_loss = loss(Y_train, dict_f["l2"])

        #test 
        dict_f = forward(X_test, params)
        test_loss = loss(Y_test, dict_f["l2"])
        acc = accuracy(Y_test, dict_f["l2"])
        
        #update counter 
        count += 1 
        
        #save results 
        iter_results = {'iteration': count , 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
        results_per_iter.append(iter_results) 
        
        #display results 
        if (count%1000 == 0) : 
            print("Training {}: training loss = {}, test loss = {}, test accuracy = {} ".format(count,train_loss, test_loss, acc))
        
    #train 
    dict_f =forward(X_train, params)
    train_loss = loss(Y_train, dict_f["l2"])
    
    #test 
    dict_f = forward(X_test, params)
    test_loss = loss(Y_test, dict_f["l2"])
    acc = accuracy(Y_test, dict_f["l2"])
    
    #save results 
    e = n+1 
    results = {'epoch': e, 'train_loss':train_loss, 'test_loss': test_loss, 'test_accuracy':acc }
    results_per_epoch.append(results)
    
    
print("Training done!")
*** Epoch 0 ***
Training 1000: training loss = 0.38355553341135973, test loss = 0.3724378859108159, test accuracy = 89.01 
Training 2000: training loss = 0.3593388943203853, test loss = 0.36250821527669, test accuracy = 89.19 
Training 3000: training loss = 0.3173773357911038, test loss = 0.3171075609996651, test accuracy = 90.64 
Training 4000: training loss = 0.3480446505682277, test loss = 0.3618698384653955, test accuracy = 89.9 
Training 5000: training loss = 0.28684855488089483, test loss = 0.3016496329769622, test accuracy = 91.23 
Training 6000: training loss = 0.2767988529906571, test loss = 0.28050895592373754, test accuracy = 91.71000000000001 
*** Epoch 1 ***
Training 1000: training loss = 0.35641718682145784, test loss = 0.36632933057348877, test accuracy = 88.64 
Training 2000: training loss = 0.3564540217338831, test loss = 0.3711754956476994, test accuracy = 89.44 
Training 3000: training loss = 0.26416674736248996, test loss = 0.27520556760283676, test accuracy = 91.95 
Training 4000: training loss = 0.2607346357944239, test loss = 0.2787991979786823, test accuracy = 91.84 
Training 5000: training loss = 0.24823455414601278, test loss = 0.2590064420233197, test accuracy = 92.13 
Training 6000: training loss = 0.24339353492441693, test loss = 0.2419777165580345, test accuracy = 92.77 
*** Epoch 2 ***
Training 1000: training loss = 0.27255113699593736, test loss = 0.29599908356886795, test accuracy = 91.53 
Training 2000: training loss = 0.23938062548106737, test loss = 0.24885206657253417, test accuracy = 92.92 
Training 3000: training loss = 0.25739419730428764, test loss = 0.27214245862355396, test accuracy = 92.08 
Training 4000: training loss = 0.26084625394861716, test loss = 0.272947114345603, test accuracy = 92.2 
Training 5000: training loss = 0.240215044139242, test loss = 0.24878243191874272, test accuracy = 92.95 
Training 6000: training loss = 0.24340524266235095, test loss = 0.27626885465156054, test accuracy = 91.89 
Training done!
CPU times: total: 1h 38min 4s
Wall time: 19min 22s
In [114]:
df_iter6 = pd.DataFrame.from_dict(results_per_iter)
df_iter6
Out[114]:
iteration train_loss test_loss test_accuracy
0 1 2.300597 2.301236 9.78
1 2 2.291695 2.292165 10.61
2 3 2.281327 2.281890 12.16
3 4 2.271067 2.271815 13.28
4 5 2.261570 2.262194 14.66
... ... ... ... ...
17995 5996 0.232484 0.264917 92.38
17996 5997 0.235673 0.268251 92.31
17997 5998 0.238533 0.271150 92.17
17998 5999 0.240725 0.273587 91.99
17999 6000 0.243405 0.276269 91.89

18000 rows × 4 columns

In [115]:
#Plot results 
fig = px.line()
fig.update_layout(template = 'plotly_dark',legend=dict(title = 'Select epoch:', 
    yanchor="top",
    y=0.25,
    xanchor="left",
    x=0.90), title = 'Test accuracy for every iteration')

fig.update_xaxes(title_text='Iterations')
fig.update_yaxes(title_text='Test accuracy')

subop = {'Epoch 1': df_iter6['test_accuracy'][0:6000,],
         'Epoch 2': df_iter6['test_accuracy'][6000:12000,],
         'Epoch 3': df_iter6['test_accuracy'][12000:18000,]}

for k, v in subop.items():
    fig.add_scatter(x=v.index, y = v, name = k )

fig.show()